AITopics

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Austria > Vienna (0.14)
Europe > Germany > Rhineland-Palatinate > Kaiserslautern (0.04)
(13 more...)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Neural Information Processing SystemsFeb-13-2026, 01:35:16 GMT

fb8fe6b79288f3d83696a5d276f4fc9d-Supplemental-Conference.pdf

assumption 1, inequality, lemma, (14 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.40)

Merrill, William, Sabharwal, Ashish

Exact Expressive Power of Transformers with Padding

arXiv.org Artificial IntelligenceNov-7-2025

Chain of thought is a natural inference-time method for increasing the computational power of transformer-based large language models (LLMs), but comes at the cost of sequential decoding. Are there more efficient alternatives to expand a transformer's expressive power without adding parameters? We consider transformers with padding tokens as a form of parallelizable test-time compute. We show that averaging-hard-attention, masked-pre-norm transformers with polynomial padding recognize precisely the class $\mathsf{FO}$-uniform $\mathsf{TC}^0$ of extremely parallelizable problems. While the $\mathsf{TC}^0$ upper bound was known, proving a matching lower bound had been elusive. Further, our novel analysis reveals the precise expanded power of padded transformers when coupled with another form of inference-time compute, namely dynamically increasing depth via looping. Our core technical contribution is to show how padding helps bring the notions of complete problems and reductions, which have been a cornerstone of classical complexity theory, to the formal study of transformers. Armed with this new tool, we prove that padded transformers with $O(\log^d n)$ looping on inputs of length $n$ recognize exactly the class $\mathsf{FO}$-uniform $\mathsf{TC}^d$ of moderately parallelizable problems. Thus, padding and looping together systematically expand transformers' expressive power: with polylogarithmic looping, polynomially padded transformers recognize precisely the class $\mathsf{FO}$-uniform $\mathsf{NC}$, the best that could be expected without losing parallelism (unless $\mathsf{NC} = \mathsf{P}$). Our results thus motivate further exploration of padding and looping as parallelizable alternatives to chain of thought for test-time compute.

large language model, machine learning, natural language, (19 more...)

2505.18948

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Neural Information Processing SystemsOct-10-2025, 13:20:38 GMT

af58a33861ac45472ea1cc5860d2b13e-Paper-Conference.pdf

sequence, uhat, vector, (16 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Austria > Vienna (0.14)
Europe > Germany > Rhineland-Palatinate > Kaiserslautern (0.04)
(13 more...)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Neural Information Processing SystemsAug-19-2025, 21:46:32 GMT

Appendix for " Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks " A Lemmas

In this section, we collect several lemmas useful for our analysis. The proof is completed.Lemma A.2. Let W, W According to Taylor's theorem, there exists α [0, 1] such that ℓ(W; z) ℓ(W The proof is completed.The following lemma shows the self-bounding property of smooth and nonnegative functions. Let Assumptions 1, 2 hold. Let Assumptions 1, 2 hold. The remaining arguments in proving Lemma A.6 is the same as proving Lemma 5 in [ Let Assumptions 1, 2 hold.

artificial intelligence, lemma, machine learning, (15 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.40)

arXiv.org Artificial IntelligenceAug-12-2025

Towards High-Order Mean Flow Generative Models: Feasibility, Expressivity, and Provably Efficient Criteria

Cao, Yang, Chen, Yubin, Song, Zhao, Zhang, Jiahao

Generative modelling has seen significant advances through simulation-free paradigms such as Flow Matching, and in particular, the MeanFlow framework, which replaces instantaneous velocity fields with average velocities to enable efficient single-step sampling. In this work, we introduce a theoretical study on Second-Order MeanFlow, a novel extension that incorporates average acceleration fields into the MeanFlow objective. We first establish the feasibility of our approach by proving that the average acceleration satisfies a generalized consistency condition analogous to first-order MeanFlow, thereby supporting stable, one-step sampling and tractable loss functions. We then characterize its expressivity via circuit complexity analysis, showing that under mild assumptions, the Second-Order MeanFlow sampling process can be implemented by uniform threshold circuits within the $\mathsf{TC}^0$ class. Finally, we derive provably efficient criteria for scalable implementation by leveraging fast approximate attention computations: we prove that attention operations within the Second-Order MeanFlow architecture can be approximated to within $1/\mathrm{poly}(n)$ error in time $n^{2+o(1)}$. Together, these results lay the theoretical foundation for high-order flow matching models that combine rich dynamics with practical sampling efficiency.

machine learning, meanflow, natural language, (17 more...)

2508.07102

Country: North America > United States (0.46)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Liu, Yuxi, Preechakul, Konpat, Kuwaranancharoen, Kananart, Bai, Yutong

The Serial Scaling Hypothesis

arXiv.org Machine LearningJul-18-2025

While machine learning has advanced through massive parallelization, we identify a critical blind spot: some problems are fundamentally sequential. These "inherently serial" problems-from mathematical reasoning to physical simulations to sequential decision-making-require dependent computational steps that cannot be parallelized. Drawing from complexity theory, we formalize this distinction and demonstrate that current parallel-centric architectures face fundamental limitations on such tasks. We argue that recognizing the serial nature of computation holds profound implications on machine learning, model design, hardware development. As AI tackles increasingly complex reasoning, deliberately scaling serial computation-not just parallel computation-is essential for continued progress.

large language model, machine learning, natural language, (17 more...)

arXiv.org Machine Learning

2507.12549

Country:

North America > United States (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(3 more...)

Genre: Research Report (0.65)

Industry: Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.67)

arXiv.org Artificial IntelligenceJul-18-2025

Perfect diffusion is $\mathsf{TC}^0$ -- Bad diffusion is Turing-complete

Liu, Yuxi

This paper explores the computational complexity of diffusion-based language modeling. We prove a dichotomy based on the quality of the score-matching network in a diffusion model. In one direction, a network that exactly computes the score function of some initial distribution can only perform language modeling within the $\mathsf{TC}^0$ complexity class, reflecting limitations tied to rapid convergence. In the other direction, we show that if there is no requirement for the network to match any score function, then diffusion modeling can simulate any Turing machine in a certain sense. This dichotomy provides a theoretical lens on the capabilities and limitations of diffusion models, particularly concerning tasks requiring sequential computation. We conjecture extensions of our theoretical results, including for the case where the diffusion model is not perfect, but merely good. We also discuss the wider context and practical implications, and hypothesize that a machine learning architecture that can interpolate between sequential and parallel modes of operation would be superior to both Transformers and diffusion models.

diffusion model, machine learning, natural language, (19 more...)

2507.12469

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)

arXiv.org Artificial IntelligenceMay-29-2025

Can Test-time Computation Mitigate Memorization Bias in Neural Symbolic Regression?

Sato, Shun, Sato, Issei

Symbolic regression aims to discover mathematical equations that fit given numerical data. It has been applied in various fields of scientific research, such as producing human-readable expressions that explain physical phenomena. Recently, Neural symbolic regression (NSR) methods that involve Transformers pre-trained on large-scale synthetic datasets have gained attention. While these methods offer advantages such as short inference time, they suffer from low performance, particularly when the number of input variables is large. In this study, we hypothesized that this limitation stems from the memorization bias of Transformers in symbolic regression. We conducted a quantitative evaluation of this bias in Transformers using a synthetic dataset and found that Transformers rarely generate expressions not present in the training data. Additional theoretical analysis reveals that this bias arises from the Transformer's inability to construct expressions compositionally while verifying their numerical validity. We finally examined if tailoring test-time strategies can lead to reduced memorization bias and better performance. We empirically demonstrate that providing additional information to the model at test time can significantly mitigate memorization bias. On the other hand, we also find that reducing memorization bias does not necessarily correlate with improved performance. These findings contribute to a deeper understanding of the limitations of NSR approaches and offer a foundation for designing more robust, generalizable symbolic regression methods. Code is available at https://github.com/Shun-0922/Mem-Bias-NSR .

artificial intelligence, expression, machine learning, (16 more...)

2505.22081

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

London, Charles, Kanade, Varun

Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

arXiv.org Artificial IntelligenceMay-28-2025

Pause tokens, simple filler symbols such as "...", consistently improve Transformer performance on both language and mathematical tasks, yet their theoretical effect remains unexplained. We provide the first formal separation result, proving that adding pause tokens to constant-depth, logarithmic-width Transformers strictly increases their computational expressivity. With bounded-precision activations, Transformers without pause tokens compute only a strict subset of $\mathsf{AC}^0$ functions, while adding a polynomial number of pause tokens allows them to express the entire class. For logarithmic-precision Transformers, we show that adding pause tokens achieves expressivity equivalent to $\mathsf{TC}^0$, matching known upper bounds. Empirically, we demonstrate that two-layer causally masked Transformers can learn parity when supplied with pause tokens, a function that they appear unable to learn without them. Our results provide a rigorous theoretical explanation for prior empirical findings, clarify how pause tokens interact with width, depth, and numeric precision, and position them as a distinct mechanism, complementary to chain-of-thought prompting, for enhancing Transformer reasoning.

large language model, machine learning, natural language, (21 more...)

2505.21024

Country: Europe > United Kingdom (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)